Bulletpapers - Understand complex papers in seconds

May 2024

Using language models to classify educational text difficulty

This paper introduces prompt-based metrics that leverage the language understanding capabilities of large language models to evaluate the difficulty of educational texts. Combined with traditional readability metrics, these prompt-based metrics significantly improve automatic classification of texts by education level. The metrics help measure how well language models a...

May 2024

Language models' ability to capture implied meanings

This paper investigates whether language models can capture implied discourse-level meanings, beyond just semantic features of words. The authors focus on differential object marking in Korean, where markers convey both semantic and discourse information. They evaluate several language models, finding that larger models better capture discourse meanings of a dedicated d...

May 2024

Improving machine translation by mitigating hallucination and omission

This paper proposes using word alignment as a preference signal to optimize large language model-based machine translation systems. By comparing translations from multiple systems, preferred translations with better word alignment coverage are selected. Subsequent optimization aligns the model toward producing translations resembling the preferred ones. Extensive experi...

May 2024

Online iterative reinforcement learning from human feedback

This technical report provides a detailed recipe for online iterative reinforcement learning from human feedback (RLHF) to align large language models (LLMs). It uses a proxy preference model trained on diverse open datasets to approximate human feedback. The method trains an LLM called SFR-Iterative-DPO-LLaMA-3-8B-R that achieves state-of-the-art performance on LLM cha...

May 2024

Evaluating and Reducing Hallucinations in Vision-Language Models

The paper proposes THRONE, a new benchmark to evaluate 'Type I' hallucinations (in open-ended responses) in large vision-language models (LVLMs). It utilizes language models to identify hallucinations and introduces metrics to quantify them. The paper demonstrates that reducing 'Type II' hallucinations (in responses to specific questions) does not reduce Type I hallucin...

May 2024

Aligning language models for information extraction

This paper introduces ADELIE, a series of aligned language models that achieve state-of-the-art performance on various information extraction tasks. The models are trained on IEInstruct, a new high-quality dataset tailored for information extraction alignment. Experiments show ADELIE matches or exceeds prior models on closed IE, open IE and on-demand IE, with no decline...

May 2024

Repairing C/C++ Code Vulnerabilities via Node Type Analysis

This paper proposes NAVRepair, a novel framework that combines node-type information from Abstract Syntax Trees with error types to provide precise repair context and targeted fixes for C/C++ code vulnerabilities. It customizes analysis based on the Minimum Edit Node to collect relevant context. Extensive experiments show NAVRepair assists large language models in accur...

May 2024

Transferable text-image person matching

This paper studies the challenging problem of developing AI models that can match images of people to textual descriptions, and directly apply the models to new datasets without additional fine-tuning. The key ideas are: (1) use large language models to automatically generate a massive training dataset, (2) enhance diversity of the textual descriptions, and (3) identify...

May 2024

Efficient GPU serving of LLMs with 4-bit quantization

The paper introduces QoQ, an algorithm and system for serving large language models on GPUs using 4-bit quantization of weights, 8-bit quantization of activations, and 4-bit quantization of key-value caches. This allows higher throughput compared to prior work while maintaining accuracy. The key insight is reducing overhead of operations on slow CUDA cores. Progressive ...

May 2024

Robust planning from minimal text

This paper presents NL2Plan, a new system that allows users to create complete PDDL domain and problem descriptions from minimal natural language text inputs. An LLM extracts necessary information incrementally, then a classical planner solves the resulting PDDL task. This combines the language understanding strengths of LLMs with the robust planning abilities of classi...

May 2024

Language Models Improve Pose Estimation

This paper presents a method that uses large language models to refine 3D human pose estimates by generating natural language descriptions of physical contacts from images. These descriptions are converted into optimization constraints to capture semantics like hugs, hand-holding, and yoga poses. Without extra training data, the method performs comparably to more comple...

May 2024

Text classification without relying on seen classes

This paper proposes a strategy for few-shot and zero-shot text classification that does not require training on seen classes. A large language model generates pseudo samples for novel classes, with the most representative samples kept as anchors. Classification is reframed as judging similarity between query and anchors. This simplifies the task and better utilizes limi...

May 2024

Process-free mathematical reasoning via Monte Carlo Tree Search

This paper introduces a new technique to train large language models to solve complex, multi-step math problems without requiring manual annotations of the reasoning process steps. It uses Monte Carlo Tree Search to automatically generate supervised data and step-level value signals for model training. Experiments show significant accuracy gains on challenging test sets.

May 2024

Scaling instruction data from the web to enhance language model reasoning

This paper proposes harvesting 10 million high-quality instruction-response pairs from the web to improve language models' reasoning abilities, without requiring costly human annotation or GPT-4 distillation. A 3-step pipeline recalls relevant documents, extracts instruction pairs, and refines them using open-source models. Fine-tuning on this data significantly boosts ...

May 2024

An open-source evaluator model

This paper introduces Prometheus 2, an open-source language model specialized for evaluating the quality of text generated by other language models. It demonstrates superior performance in providing scores and rankings that closely match human judgment, while also allowing flexible evaluation based on custom criteria beyond just helpfulness.

May 2024

Factual language model alignment

This paper studies how to align language models to follow instructions while reducing false claims. It finds that standard alignment methods can increase hallucination by training models on unfamiliar data or rewarding very detailed responses. The authors propose methods to make alignment more factual, by eliciting knowledge from the model itself and using separate rewa...

May 2024

Instruction tuning enables controllable text generation

This paper explores using instruction tuning of large language models as an approach to controllable text generation. The authors introduce an algorithm to automatically generate constraint datasets from only a task dataset and natural language description. They benchmark instruction-tuned models on a new testbed, ConGenBench, finding that prompting outperforms other co...

May 2024

Integrating creativity into large language models

This paper discusses ways to address a key limitation of large language and vision models: their lack of creative problem-solving abilities. It summarizes approaches from the field of computational creativity that could impart creative skills to these models, including exploratory search, conceptual combination, and transformational prompting. Preliminary experiments de...

May 2024

Verifying and refining AI explanations through logic

This paper presents a system integrating large language models and theorem provers to evaluate, formalize, and refine natural language explanations for AI reasoning tasks. The system leverages symbolic logic to verify explanation validity and provide feedback for iterative improvements.

May 2024

Automated product placement in images

This paper proposes a 3-stage system for virtually placing products into images automatically. First it identifies good locations using language and segmentation models. Then it uses a fine-tuned Stable Diffusion model to paint the product into the image. Finally an Alignment Module removes low quality images, improving average quality by 35%. This shows potential for a...

May 2024

Using language models for query expansion and re-ranking

This paper explores generative techniques to expand queries and re-rank results using large language models. Different methods are tested, including zero-shot query reformulation, pseudo-relevance feedback, and adaptive re-ranking over corpus graphs. Combining these approaches leads to performance gains, with the best run using both generative pseudo-relevance feedback ...

April 2024

Iterative training for reasoning tasks

This paper develops an iterative training approach to improve reasoning ability in language models. It focuses on chain-of-thought style reasoning. In each iteration, competing reasoning chains are generated and optimized to prefer chains leading to correct answers over incorrect ones. This is done by extending the DPO training method with an additional loss term. Over ...

April 2024

Benchmarking language models' linguistic competence

This paper introduces Holmes, a benchmark that evaluates language models' ability to understand key linguistic concepts like syntax and semantics. It reviews over 250 studies on probing language models and includes over 200 datasets covering phenomena like part-of-speech tagging and rhetorical structure. Experiments on over 50 language models show performance correlates...

April 2024

Self-speculative decoding via early exiting

This paper proposes Kangaroo, a self-speculative decoding method that accelerates inference for large language models while maintaining output quality. Kangaroo uses a fixed, shallow sub-network from the target model as a 'self-draft' model, with an additional lightweight adapter module to improve representation. To minimize drafting latency, Kangaroo also applies early...

April 2024

Impact of preference alignment on trust in language models

This study investigates how alignment with general helpfulness/harmlessness preferences affects language models across 5 trustworthiness aspects: toxicity, bias, ethics, truthfulness, privacy. Results show improvement not guaranteed; complex interplay between preferences, algorithms, trustworthiness. Underscores need for nuanced alignment approaches to develop capable a...

April 2024

Language models represent truth values

This paper investigates whether language models contain directions in their latent spaces that correlate with the truth values of sentences. The authors evaluate probes that identify these 'belief directions', analyzing their consistency and sensitivity to contextual information. Through experiments on multiple models, they find the probes are generally context-sensitiv...

April 2024

Efficient fine-tuning of language models with federated learning

This paper proposes FeDeRA, an efficient method to fine-tune language models with federated learning. FeDeRA is based on the LoRA technique, but initializes the adapter module differently using singular value decomposition on the pre-trained weights. This provides robustness to non-IID data. Experiments on various datasets and models show FeDeRA matches or exceeds the a...

April 2024

Analyzing semantic change through word replacements

This paper proposes a schema to model semantic change, where target words are replaced with related or random words to simulate lexical innovation. The scheme helps build an interpretable semantic change model using replacements' contextual distance across time. It also pioneers assessing generative language model LLaMa's ability to capture semantic change.

April 2024

Realistic Material Assignment for 3D Objects

This paper presents Make-it-Real, a novel approach that leverages large language models to identify materials from images and assign them to 3D objects. This allows creating realistic material properties for existing 3D assets and generated models.

April 2024

Benchmark for evaluating language models on generation tasks in Indian languages

The authors release IndicGenBench, a new benchmark to measure the ability of language models to perform generation tasks like summarization, translation, and question answering across 29 languages native to India. It extends existing datasets, providing multi-way parallel test data in many under-resourced Indic languages for the first time.

April 2024

Deleting tokenization from language models

This paper proposes SpaceByte, a novel byte-level decoder that matches the performance of subword language models without using tokenization. SpaceByte uses a byte-level Transformer, but inserts extra blocks to model words and phrases. Experiments show SpaceByte matches subword models for the same compute budget, avoiding issues like performance bias from predetermined ...

April 2024

Pre-training language models to use calculators

This paper explores pre-training objectives to teach smaller encoder-decoder language models to use calculators, formulating it as a classification task for encoder-only models and a sequence generation task for encoder-decoder models. Pre-training on math word problems shows improved performance on downstream NLP tasks requiring numerical reasoning.

April 2024

Quantizing LLaMA3 Models to Low Bitwidths

This paper evaluates quantizing the powerful LLaMA3 language models to low bitwidths from 1-8 bits using 10 different quantization methods. Results show LLaMA3 still suffers significant performance degradation at low bitwidths, highlighting a gap for future methods to improve.

April 2024

Guiding retrieval in language models via missing information

This paper explores how language models can identify missing information needed to answer complex questions, and use that to guide an iterative process of retrieving and integrating external knowledge. Through experiments, the authors find language models adept at pinpointing gaps in their knowledge, generating focused follow-up queries, and extracting useful info from ...

April 2024

Evaluating AI Chatbots' Ability to Emulate Ordinary People

This paper introduces ECHO, a framework to assess chatbots' ability to convincingly emulate average people in conversations. It engages acquaintances of target individuals to distinguish between human and machine responses. Results show GPT-4 more effectively deceives human evaluators, with GPTs achieving a 48.3% success rate. Additionally, GPT-4 could identify differen...

April 2024

Multimodal models challenged by core visual perception

This paper introduces Blink, a benchmark evaluating key visual perception skills in multimodal language models. It finds that while humans can solve these visual tasks easily, state-of-the-art models still struggle significantly. Specialized computer vision models perform much better, suggesting potential pathways to improve multimodal models.

April 2024

Language models learn human preferences via token rewards

This paper shows language models trained with human preferences learn implicit token rewards matching human judgments. Researchers derive theory connecting preference learning to token reinforcement learning. Empirically, they show the trained model assigns token rewards corresponding to human judgment, and that search over the learned policy matches searching over a le...

April 2024

Cross-lingual alignment of language models

This paper evaluates using a reward model trained on human preferences in one language to improve language models in other languages. It shows this cross-lingual transfer is effective for summarization and dialog. Sometimes a different-language reward model works better than using preferences from the model's language.

April 2024

Detecting irony using language models and emotion analysis

This paper introduces a new irony detection method that uses large language models to expand texts with more emotional cues, then processes the expanded texts with BERT, T5 and GPT-2 models. Testing shows improved irony detection compared to baselines.

April 2024

Improving language model robustness through self-denoising

This paper proposes a self-denoising technique to enhance the robustness of large language models against adversarial attacks. By leveraging the model's own capability to fill in masked words, noisy inputs can be denoised before final predictions. This boosts performance on corrupted inputs required by randomized smoothing defenses. Experiments show the method improves ...

April 2024

Small, efficient instruction-tuned language models

This paper describes a method to efficiently fine-tune small language models for following instructions, using synthetic instruction data generated from a larger model. The resulting 'OpenBezoar' models are aligned to human preferences using techniques like RLHF and DPO, and evaluated to have performance competitive with or exceeding larger models on some benchmarks.

April 2024

Aligning body motion to AI-generated descriptions

This paper explores using large language models to generate detailed textual descriptions of human motions, including actions and walking patterns. The descriptions help align motion data to language for tasks like action recognition and gait analysis. Key findings show promise for improving motion understanding and connecting multi-modal data with AI.

April 2024

Automating claim detection and prioritization for fact-checking

This paper explores using large language models (LLMs) to identify claims requiring fact-checking and prioritize them based on worthiness criteria. Experiments show optimal prompt design is domain-dependent. Adding context does not improve accuracy. LLMs can produce reliable claim rankings using confidence scores.

April 2024

Stance detection using large language models

This paper evaluates different large language models for stance detection on social media. It looks at models like ChatGPT, LLaMa-2, and Mistral-7B, comparing their accuracy after fine-tuning on public datasets. The key findings show exceptional stance detection abilities for all models, with LLaMa-2 and Mistral-7B demonstrating remarkable efficiency despite smaller siz...

April 2024

Probing language models for consistent understanding

This paper evaluates whether language models exhibit consistent understanding across different forms of the same input. It tests a large language model, GPT-3.5, by asking it factual questions and commonsense reasoning problems in different languages and paraphrases. The authors find that the model often gives inconsistent responses to inputs with the same meaning, indi...

April 2024

Simulating fictional character decisions

This paper investigates whether large language models can accurately predict the key decisions made by characters in novels, when provided with the preceding storyline context. The authors construct a dataset of 1,401 decision points from 395 books, sourced from literary expert analyses, to benchmark model performance. Experiments demonstrate promising capabilities, yet...

April 2024

Enhancing language model robustness to character variation attacks

This paper proposes a method called CHANGE that integrates a Chinese character variation graph into pre-trained language models to make them more robust to adversarial attacks that manipulate characters. Through graph-based pre-training tasks, CHANGE helps models better interpret text that has been altered with character substitutions or variations, outperforming prior ...

April 2024

Token-level policy optimization for aligning language models

This paper introduces Token-level Direct Preference Optimization (TDPO), an approach to align language models with human preferences during text generation by directly optimizing policy at the token level. It improves alignment and diversity compared to methods that evaluate full generated responses.

April 2024

Benchmarking language model performance across languages

This paper proposes a 'Language Ranker' method to quantitatively benchmark and rank the performance of large language models across high-resource languages like English and French as well as low-resource languages. The key idea is to use the language model's representations on an English dataset as a baseline, and measure the similarity of representations on other langu...

April 2024

Improving image retrieval with composed queries

This paper proposes methods to improve composed image retrieval models by generating more training data and introducing additional negative examples during training. It uses a large language model to automatically create more positive training examples from image datasets. It also freezes the image encoder after initial training to bring in many static negative examples...

April 2024

Self-playing game improves language model reasoning

Researchers developed a two-player adversarial game called 'Adversarial Taboo' to improve language models' reasoning skills. One player tries to get the other to unconsciously say a secret word, while the other player tries to guess the word. By having language models play against themselves in this game, their reasoning abilities on benchmarks uniformly improved throug...

April 2024

Auto-prompt graphical paradigm for language models

This paper proposes an automated prompt framework that combines emotional stimulation and structural guidance to enhance language models' problem-solving abilities across domains. The framework involves generating prompts automatically based on emotion and structure, guiding models through problem abstraction, solution generation, optimization, and self-verification. Co...

April 2024

Detecting unknown biases in text-to-image models

This paper proposes OpenBias, an automatic pipeline to detect and quantify biases in text-to-image models without needing a predefined list of biases. It has 3 stages: first, a language model proposes possible biases for a set of captions. Then, a text-to-image model generates images from those captions. Finally, a vision question answering model recognizes if those pro...

April 2024

RecurrentGemma model with efficient long sequence processing

RecurrentGemma is a 2 billion parameter language model that uses a novel Griffin architecture to enable fixed memory usage during inference, allowing more efficient processing of long text sequences. It achieves strong performance on language tasks while requiring less memory than transformer models.

April 2024

Improving consistency of AI-generated planning models

This paper presents a concept to improve consistency of planning models generated by large language models, using automated error checking during generation. This reduces corrections needed by humans. The approach is demonstrated on classical and custom planning domains like logistics and pizza cooking.

April 2024

Automating early-stage scientific research through AI assistance

This paper proposes an AI system called ResearchAgent that aims to accelerate the initial phase of scientific research. It automatically generates research ideas involving problems, methods, and experiments using large language models that are enhanced with links to related publications and relevant entities extracted from scientific literature.

April 2024

Vision-language models estimate object reflectance

This paper explores using large language models (LLMs) and vision-language models (VLMs) to estimate the infrared light reflectance of objects, which is key for robotic grasping. The authors find that LLMs like GPT-3.5 and GPT-4 can estimate reflectance from just object names, while VLMs like CLIP leverage both visual and textual knowledge for even better image-based es...

April 2024

Effect of duplicate subwords on language model efficiency

This paper investigates the impact of duplicate and near-duplicate subwords in language model vocabularies. Through controlled experiments duplicating subwords and merging near-duplicates, they quantify potential improvements from better generalization across duplicates. They find duplicates hurt efficiency, costing models ~17% more data, but near-duplicates are less in...

April 2024

Text reasoning for vector graphics

This paper proposes a new method to perform precise reasoning on vector graphics images, which are composed of 2D shapes, using text-based representations. It first converts images to Scalable Vector Graphics (SVG) code to capture details. Then it learns to map SVG paths to Primal Visual Descriptions (PVD), which describe shapes using attributes like position and color....

April 2024

Efficient fine-tuning of large language models

This paper proposes an automated pipeline called FedPipe to efficiently fine-tune large language models (LLMs) in a privacy-preserving federated learning setting across heterogeneous edge servers. FedPipe identifies the most important weights to fine-tune based on their contribution to model performance. It configures specialized low-rank adapters for those weights on e...

April 2024

Benchmarking progress of LLM agents

This paper introduces AgentQuest, a modular framework to benchmark and evaluate the performance of LLM-based agents on complex reasoning tasks. It offers easy-to-use APIs to connect agents and benchmarks, and defines new metrics like progress rate and repetition rate that go beyond binary success/failure to track how agents advance step-by-step. The utility of these met...

April 2024

Evaluating causal learning capabilities of large language models

This paper proposes a comprehensive benchmark called CausalBench to evaluate how well large language models can understand causality, which is important for explaining outputs, adapting to new evidence, and generating counterfactuals. CausalBench includes causal learning tasks to compare LLMs to classic algorithms, networks of varying sizes to test capability limits, an...

April 2024

Open Indonesian language models for diverse tasks

The authors introduce Cendol, a collection of Indonesian language models for text generation. Cendol includes decoder-only and encoder-decoder models across various sizes. Experiments show Cendol models outperform existing models on Indonesian NLP tasks by 20%. Cendol also generalizes to unseen tasks and Indonesian regional languages. However, Cendol still falls behind ...

April 2024

Memory bank for long videos

This paper proposes storing past video frames in a memory bank to enable large language models to understand long videos, overcoming context length limits. It processes videos online, storing features in the memory bank, allowing historical reference without exceeding GPU memory.

April 2024

Fast personalized image generation

This paper introduces MoMA, an open-source model for fast personalized image generation using a single reference image. MoMA utilizes a multimodal language model to extract image features and text prompts, then generates new images through an image diffusion model. A novel self-attention method transfers details between images. MoMA performs well at recontextualizing su...

April 2024

Contextual tagging for named entity recognition

This paper introduces LTNER, a named entity recognition framework that uses contextualized entity marking to leverage large language models' context learning abilities, significantly improving their NER accuracy without additional training. On the CoNLL03 dataset, F1 scores increased from 85.9% to 91.9%, nearing supervised fine-tuning levels. LTNER is robust with few ex...

April 2024

Aligning speech generation to human preferences

This paper proposes SpeechAlign, an iterative strategy to align speech language models to human preferences without needing additional human-annotated data. It analyzes and addresses the distribution gap between training and inference in current models. SpeechAlign constructs a preference dataset contrasting golden versus synthetic codec tokens and conducts preference o...

April 2024

Extracting software information with language models

This paper explores using generative language models for information extraction tasks related to software entities in academic texts. It focuses on named entity recognition and relation extraction between software mentions and attributes like versions and citations. The authors employ an approach leveraging retrieval augmented generation and re-framing relation extracti...

April 2024

Evaluating LLMs' ability to reason about interventions

This paper introduces benchmarks to evaluate whether large language models can accurately update their knowledge about causal relationships after an intervention is performed. The benchmarks involve predicting how hypothetical interventions like randomized experiments would modify causal graphs across areas like mediation and confounding. Analysis of four language model...

April 2024

Manipulating language models by poisoning preference data

This paper studies how malicious actors can manipulate language model generations by injecting a small number of poisoned preference pairs (1-5% of the dataset) into publicly available preference datasets used for reinforcement learning from human feedback training. The poisoned data causes the model to frequently generate a target entity with desired positive or negati...

April 2024

Selecting Facts for Effective Bug Repair with Large Language Models

This paper investigates how to construct effective prompts for large language model-based automated program repair by selecting relevant bug facts to include. It finds each fact aids in fixing some bugs, but too many facts degrade performance. This led to defining the fact selection problem: choosing an optimal set of facts for a prompt to maximize repair. A model is de...

April 2024

Guiding language models to generate structured content

This paper proposes a method to guide large language models to generate structured content that follows specific conventions, without needing additional fine-tuning. By using coroutine-based content generation constraints defined through a context-free grammar, language models can be directed during decoding to produce outputs that comply with formal languages represent...

April 2024

Enhancing language models with reflection on search trees

This paper introduces a framework called Reflection on Search Trees (RoT) to improve language models' performance on reasoning and planning tasks that use tree search methods. RoT summarizes guidelines from past search experiences to prevent the model from repeating mistakes. A novel state selection method identifies critical information to generate meaningful, specific...

April 2024

Benchmark for long context language model understanding

This paper introduces XL2Bench, a benchmark to evaluate language models' ability to understand very long texts (100K+ words English, 200K+ characters Chinese). It has 3 scenarios (fiction, papers, laws) and 4 tasks (memory retrieval, detailed understanding, overall understanding, open-ended generation) across 27 subtasks. Experiments on 6 leading LLMs show performance l...

April 2024

Autonomous bug fixing and feature addition

This paper proposes an AI-based approach called AutoCodeRover to autonomously fix bugs and add features in software projects, in order to ease developers' workload. It combines large language models with iterative code search techniques to effectively understand GitHub issues and generate patches. Experiments on 300 real issues show it can resolve over 20%, outperformin...

April 2024

Enhancing Persian conversational question answering with keyword extraction and language models

This paper presents a new technique to improve conversational question answering systems for Persian. It combines contextual keyword extraction from the conversation with large language models to generate more precise, relevant, and coherent responses. Evaluations show significant improvements over baseline methods.

April 2024

Evaluating GPT-4's Ability to Identify Logical Fallacies

This paper evaluates how accurately the GPT-4 language model can identify 7 common types of logical fallacies in text. The authors tested GPT-4 against a dataset of fallacy examples, finding it was 79% accurate overall and 90% accurate when only considering valid fallacy identifications. Though imperfect, they deemed its performance adequate for using GPT-4 to spot fall...

April 2024

Distributed fine-tuning of language models

This paper proposes DLoRA, a distributed framework that enables efficient fine-tuning of large language models (LLMs) across cloud servers and user devices. DLoRA aims to address major challenges of existing LLM fine-tuning approaches, including privacy risks from sharing user data on public cloud platforms, and high computational demands that exceed user device capabil...

April 2024

Automated web navigation with large language models

This paper develops AutoWebGLM, an automated web navigation agent using the ChatGLM3-6B language model. It can comprehend webpages and complete real-world browsing tasks by simplifying HTML content, employing curriculum and reinforcement learning, and finetuning on specific websites.

April 2024

Visualizing Thoughts to Improve Spatial Reasoning

This paper proposes Visualization-of-Thought (VoT) prompting to elicit the 'mind's eye' of large language models, enabling them to create mental images that visualize their reasoning steps in multi-hop spatial reasoning tasks. This significantly improved performance in natural language navigation, visual navigation, and 2D grid tiling tasks compared to standard promptin...

April 2024

Language models' knowledge of adjective semantics and scalar diversity

This paper probes large language models' understanding of scalar adjectives, which describe concepts like temperature and probability. It finds they have rich knowledge of adjective semantics. However, good semantics doesn't guarantee good reasoning about scalar diversity - how likely different adjectives trigger conversational implicatures. Model size also doesn't full...

April 2024

Measuring Object Hallucinations in Image Captions

This paper proposes ALOHa, a new metric to measure when image captioning models incorrectly generate objects not actually present in the image, known as 'object hallucination'. ALOHa leverages language models to extract objects from captions, matches them to reference objects, and assigns scores indicating degree of hallucination. Experiments show ALOHa identifies more ...

April 2024

Learning LTL Specifications from Explanations and Traces

This paper presents an approach that combines large language models and optimization methods to translate natural language explanations and system traces into formal Linear Temporal Logic (LTL) specifications. The approach aims to leverage the ability of language models to interpret explanations while using optimization to ensure consistency and correctness.

April 2024

Improving language models' ability to follow complex instructions

This paper introduces Conifer, a new dataset and training methodology to enhance large language models' capability to adhere to challenging, multi-constraint instructions. It utilizes GPT-4 to generate high-quality data, and a progressive scheme to develop models' skills through easy-to-complex examples and explicit process feedback.

April 2024

Evaluating large language models for assisting programmers

This paper introduces a new platform, RealHumanEval, to evaluate how well large language models can assist human programmers in writing code through autocomplete suggestions or chat interactions. The paper reports on a 213-person user study analyzing 6 language models. The study finds that models with better performance on existing benchmarks can increase programmer pro...

April 2024

Topic-based watermarking to identify AI text

This paper proposes a new watermarking technique to identify text generated by large language models versus humans. It embeds detectable signatures based on the text's topics, overcoming limitations in previous watermarking schemes that lacked robustness against attacks or practicality at scale. The proposed technique selects inclusion/exclusion token lists according to...

April 2024

Tuning Legal Language Models for Reasoning

This paper explores pretraining and instruction tuning methods to improve large language models at legal reasoning tasks. They curate a new 12 million example instruction dataset called LawInstruct, covering 24 languages and 17 jurisdictions worldwide. Using this dataset to fine-tune Flan-T5 models, they achieve a 16% accuracy boost on the LegalBench benchmark. However,...

April 2024

Automated distractor generation for math multiple-choice questions

This paper explores using large language models to automatically generate plausible but incorrect answer options (called "distractors") for multiple-choice questions in math. The key challenge is that good distractors should target common student errors and misconceptions. The authors test several methods, including fine-tuning models and prompting them with examples, o...

April 2024

Impact of prompt syntax on knowledge retrieval from language models

This paper introduces a new benchmark, CONPARE-LAMA, to analyze how subtle differences in prompt syntax and semantics affect the ability of language models to accurately retrieve relational knowledge. Controlled experiments reveal that clausal syntax prompts more consistently retrieve knowledge, combine supplementary information efficiently, and reduce response uncertai...

March 2024

Self-supervised image retrieval with open instructions

This paper introduces MagicLens, a series of self-supervised image retrieval models that can follow open-ended text instructions to find relevant images. The key insight is that image pairs naturally co-occurring on web pages contain diverse implicit relations beyond visual similarity. By using large language models to make those relations explicit as text instructions,...

March 2024

Interpreting Earth Surface Changes

This paper proposes an interactive agent that can provide comprehensive interpretation and analysis of changes on the Earth's surface based on satellite images. The agent integrates computer vision techniques for detecting changes with language models for describing changes.

March 2024

Detecting and evaluating watermarks in AI text generation

This paper proposes a framework to analyze the tradeoff between watermark detection performance and quality degradation when watermarking large language models. Comparative assessment is used to quantify quality loss for different watermark settings across models and tasks.

March 2024

Synthetic clinical data boosts NLP model performance

This paper explores using large language models to generate synthetic clinical text data, then using that data to improve performance on clinical natural language processing tasks. The synthetic data is created by prompting the language models with examples of real clinical text. A novel 'label correction' step is introduced to refine the synthetic data quality. Experim...

March 2024

Language models struggle with unreasonable math problems

This paper explores how large language models, which show promise in solving math problems, still tend to produce erroneous outputs when given mathematically unreasonable questions. The authors construct a benchmark of unreasonable problems to test language models' error detection abilities, finding models can identify some issues but fail to avoid hallucinating. They d...

March 2024

Instance-Aware Robot Navigation

This paper proposes a new method called Instance-aware Visual Language Map (IVLMap) to empower robots with instance-level and attribute-level semantic mapping. It fuses RGBD video data collected by the robot with natural language map indexing to construct maps that distinguish between objects of the same category. When integrated with large language models, IVLMap can t...

March 2024

Language models predict pedestrian paths

This paper proposes a new method to forecast pedestrian trajectories using language models instead of standard numerical regression techniques. It transforms trajectory data into text prompts that are fed into the language model, along with image captions describing the scene. Additional question-answering tasks guide the model to reason about social interactions. A spe...

March 2024

Framework for Resolving Software Bugs

The paper proposes a multi-agent framework called MAGIS to address the challenge of resolving software bugs reported in GitHub issue trackers. MAGIS uses specialized agents for planning and coding to unlock the potential of large language models in modifying code to fix bugs, outperforming these models applied directly. In experiments, MAGIS resolved almost 14% of issue...

March 2024

Large-scale language models for antibody engineering

This paper introduces IgBert and IgT5, two antibody-specific language models trained on over 2 billion unpaired and 2 million paired antibody sequences. The models outperform prior antibody and protein language models on sequence recovery, binding affinity prediction, and perplexity benchmarks. This work enables enhanced antibody engineering and design through advanced ...

March 2024

Preventing misattributions of AI chatbots' capabilities

This paper proposes enhancing the Social Transparency framework to address risks from users incorrectly attributing expertise, empathy and other human traits to large language models used in chatbots, especially in sensitive contexts like mental health. It suggests clarifying the roles and personas actually assigned by designers versus users' perceptions, to promote res...

March 2024

Two-step reranking method using language models

This paper introduces TWOLAR, a two-stage pipeline for passage reranking that distills knowledge from large language models (LLMs). It creates a diverse training dataset of 20K queries, retrieved via 4 methods and reranked by an LLM. TWOLAR matches or exceeds state-of-the-art models with far fewer parameters. Ablations validate contributions.

The history of language models